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Abstract. Many statistical problems involve mixture models and the 
need for computationally efficient methods to estimate the mixing dis- 
tribution has increased dramatically in recent years. Newton [Sankhya 
Ser. A 64 (2002) 306-322] proposed a fast recursive algorithm for es- 
timating the mixing distribution, which we study as a special case of 
stochastic approximation (SA). We begin with a review of SA, some 
recent statistical applications, and the theory necessary for analysis of 
a SA algorithm, which includes Lyapunov functions and ODE stabil- 
ity theory. Then standard SA results are used to prove consistency of 
Newton's estimate in the case of a finite mixture. We also propose a 
modification of Newton's algorithm that allows for estimation of an 
additional unknown parameter in the model, and prove its consistency. 

Key words and phrases: Stochastic approximation, empirical Bayes, 
mixture models, Lyapunov functions. 



1. INTRODUCTION 

The aim of the present paper is to review the sub- 
ject of stochastic approximation (SA), along the way 
highlighting some recent statistical applications, and 
to explore its relationship with a recent algorithm [26- 
28] for estimating a mixing distribution. 

SA was introduced in [31] as an algorithmic method 
for finding the root of a function h when only noisy 
observations on h are available. It has since devel- 
oped into an important area of systems control and 
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optimization, with numerous applications in statis- 
tics. In Section 2 we give a brief introduction to 
the SA algorithm and review three recent and inno- 
vative statistical applications. The first two [6, 16] 
strengthen the EM and Metropolis algorithms, re- 
spectively, and the third is a versatile Monte Carlo 
integration method, called Stochastic Approxima- 
tion Monte Carlo (SAMC) [21], which can be applied 
in a variety of statistical problems. We demonstrate 
that combining SAMC with the energy-temperature 
duality [18] provides a method for estimating the 
normalizing constant of a density. We then state a 
theorem providing sufficient conditions for almost 
sure convergence of a SA algorithm, which is used 
in Section 3 to study the convergence properties of a 
mixing distribution estimate. For this purpose, the 
necessary stability theory for ordinary differential 
equations (ODEs) is developed. 

Many statistical problems involve modeling with 
latent, or unobserved, random variables, for exam- 
ple, cluster analysis [24] and multiple testing or esti- 
mation with high-dimensional data [1, 8, 9, 34, 36]. 
The distribution of the manifest, or observed, ran- 
dom variables then becomes a mixture of the form 
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where 9 £@ is the latent variable or parameter, and 
/ is an unknown mixing density with respect to the 
measure n on Q. Estimation of / plays a funda- 
mental role in many inference problems, such as an 
empirical Bayes approach to multiple testing. 

For the deconvolution problem, when p{x\6) in 
(1.1) is of the form p{x — 6), asymptotic results for 
estimates of /, including optimal rates of conver- 
gence, are known [10]. A nonparametric Bayes ap- 
proach to Gaussian deconvolution is discussed in [13]. 
For estimating IIj, a Bayesian might assume an a 
priori distribution on /, inducing a prior on ITj via 
the map / i— > IIj. Consistency of the resulting esti- 
mate of Ilf is considered in [3, 11, 12]. 

In Section 3, we describe a recursive algorithm 
of Newton et al. [26-28] for estimating the mix- 
ing density /. This estimate is significantly faster 
to compute than the popular nonparametric Bayes 
estimate based on a Dirichlet process prior. In fact, 
the original motivation [27] for the algorithm was to 
approximate the computationally expensive Bayes 
estimate. The relative efficiency of the recursive al- 
gorithm compared to MCMC methods used to com- 
pute the Bayes estimate, coupled with the similarity 
of the resulting estimates, led Quintana and New- 
ton [29] to suggest the former be used for Bayesian 
exploratory data analysis. 

While Newton's algorithm performs well in exam- 
ples and simulations (see [14, 26-29, 38] and Sec- 
tion 3.3), very little is known about its large-sample 
properties. A rather difficult proof of consistency, 
based on an approximate martingale representation 
of the Kullback-Leibler divergence, is given by Ghosh 
and Tokdar [14] when G is finite; see Section 3.1. In 
Section 3.2, we show that Newton's algorithm can 
be expressed as a stochastic approximation and re- 
sults presented in Section 2.4 are used to prove a 
stronger consistency theorem than in [14] for the 
case of finite B, where the Kullback-Leibler diver- 
gence serves as the Lyapunov function. 

The numerical investigations in Section 3.3 con- 
sider two important cases when Q is finite, namely, 
when / is strictly positive on and when f{9) = 
for some 9 gQ. In the former case, our calculations 
show that Newton's estimate is superior, in terms of 
accuracy and computational efficiency, to both the 
nonparametric MLE and the Bayes estimate. For the 
latter case, when only a superset of the support of 
/ is known, the story is completely different. While 
Newton's estimate remains considerably faster than 
the others, it is not nearly as accurate. 



We also consider the problem where the sampling 
density p{x\9) of (1.1) is of the iovm p{x\9 , ^) , where 
/ is a mixing density or prior for 9, and is an ad- 
ditional unknown parameter. Newton's algorithm is 
unable to handle unknown ^, and we propose a mod- 
ified algorithm, called N -|- P, capable of recursively 
estimating both / and We express this algorithm 
as a general SA and prove consistency under suitable 
conditions. 

In Section 5 we briefiy discuss some additional 
theoretical and practical issues concerning Newton's 
recursive algorithm and the N -|- P. 

2. STOCHASTIC APPROXIMATION 
2.1 Algorithm and Examples 

Consider the problem of finding the unique root ^ 
of a function h{x). If h{x) can be evaluated exactly 
for each x and if h is sufficiently smooth, then var- 
ious numerical methods can be employed to locate 
S^. A majority of these numerical procedures, includ- 
ing the popular Newton-Raphson method, are iter- 
ative by nature, starting with an initial guess xq of 

and iteratively defining a sequence {xn} that con- 
verges to ^ as n — >■ oo. Now consider the situation 
where only noisy observations on h{x) are available; 
that is, for any input x one observes y = h{x) + s, 
where e is a zero-mean random error. This problem 
arises in situations where h{x) denotes the expected 
value of the response when the experiment is run 
at setting x. Unfortunately, standard deterministic 
methods cannot be used in this problem. 

In their seminal paper, Robbins and Monro [31] 
proposed a stochastic approximation algorithm for 
defining a sequence of design points target- 
ing the root of h in this noisy case. Start with 
an initial guess xq. At stage n > 1, use the state 
Xn-i as the input, observe yn = h{xn-i) + Sn, and 
update the guess (a;„_i,y„) i-^ x„. More precisely, 
the Robbins-Monro algorithm defines the sequence 
{xn} as follows: start with xq and, for n > 1, set 

Xn — Xfi—\ -\- Wny-n 

(2.1) 

= Xn-l + Wn{h{Xn-l) + £„}, 

where {Sn} is a sequence of i.i.d. random variables 
with mean zero, and the weight sequence {wn} sat- 
isfies 

(2.2) Wn>0, ^w„ = (X), y^w^ < oo. 

n n 
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While the SA algorithm above works in more gen- 
eral situations, we can develop our intuition by look- 
ing at the special case considered in [31], namely, 
when h is bounded, continuous and monotone de- 
creasing. If < ^, then h[xn) > and we have 

E{Xn+l\Xn) = Xn + Wn+i{h{Xn) +E{en+l)} 
= Xn + Wn+lh{Xn) 
> Xn- 

Likewise, if > then E(x„+i|x„) < x„. This shows 
that the move Xn i— >■ Xn+i will be in the correct di- 
rection on average. 

Some remarks on the conditions in (2.2) are in or- 
der. While < oo is necessary to prove conver- 
gence, an immediate consequence of this condition 
is that Wn ^ 0- Clearly — >■ implies that the ef- 
fect of the noise vanishes as n oo. This, in turn, 
has an averaging effect on the iterates Un- On the 
other hand, the condition X^^ti^n = oo washes out 
the effect of the initial guess xq. For further details, 
see [25]. 

We conclude this section with three simple exam- 
ples of SA to shed light on when and how the algo- 
rithm works. Example 2.1, taken from [19], page 4, 
is an important special case of the Robbins-Monro 
algorithm (2.1) which further motivates the algo- 
rithm as well as the conditions (2.2) on the sequence 
{wn}- Example 2.2 uses SA to find quantiles of a t- 
distribution, and Example 2.3 illustrates a connec- 
tion between SA and empirical Bayes, two of Rob- 
bins's greatest contributions. 

Example 2.1. Let be the cdf of a distribu- 
tion with mean ^. Then estimation of ^ is equivalent 
to solving h{x) = where h{x) = ^ — x. If Zi , . . . , Z„ 
are i.i.d. observations from F^, then the average Z„ 
is the least squares estimate of ^. To see that {Zn} is 
actually a SA sequence, recall the computationally 
efficient recursive expression for : 



(2.3) 



Zr 



Zn-l+n ^{Zn — Zn-l). 



If we let Xn = Zn, Wn = n ^ and y„ = Zn - Z„-i, 
then (2.3) is exactly of the form of (2.1), with {wn} 
satisfying (2.2). Moreover, if e„ = — ^, then we 
can write y„ = h{xn-i) + En- With this setup, we 
could study the asymptotic behavior of x„ using 
the SA analysis below (see Sections 2.3 and 2.4), 
although the SLLN already guarantees Xn^ i a.s. 
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Fig. 1. Sample paths of the three SA sequences {sji} in Ex- 
ample 2.2. The dotted line is the exact 75th percentile of the 
ts distribution. 



Example 2.2. Suppose we wish to find the ath 
quantile of the ti, distribution; that is, we want to 
find the solution to the equation Fy{x) = q, where 
Fy is the cdf of the ty distribution. While there are 
numerous numerical methods available (e.g., Newton- 
Raphson or bijection), we demonstrate below how 
SA can be used to solve this problem. Making use 
of the well-known fact that the distribution is a 
scale-mixture of normals, we can write 



Fy{x)=¥.[^{x\v-'Z)l Z 



where $(x|(T^) is the cdf of the (0, cr^ 
Now, for Zi , Z2 



i.i.d. 



distribution. 

X^, the sequence {un} de- 
fined as yn = oi — (^{xn-i\i'~^Zn) are noisy observa- 
tions of h{xn~i) =a — Fy{xn~i). This h is bounded, 
continuous and monotone decreasing so the Robbins- 
Monro theory says that the sequence defined 
as (2.1) converges to the true quantile, for any ini- 
tial condition xq. For illustration. Figure 1 shows 
the first 1000 iterations of the sequence {x„} for 
a = 0.75, 1/ = 5 and for three starting values xq G 
{0.5,0.75,1.0}. 

Example 2.3. In Section 3 we consider a par- 
ticular recursive estimate and show that it is of the 
form of a general SA. It turns out that the problem 
there can also be expressed as an empirical Bayes 
(EB) problem [30]. In this simple example, we demon- 
strate the connection between SA and EB, both of 
which are theories pioneered by Robbins. Consider 
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Fig. 2. Sample path of the sequence {xn} in Example 2.3. 
The dotted line is the value of ^ used for data generation. 



the simple hierarchical model 

i.i.d. 



and Zj I A, 



Ai,...,A„"~' Exp(^) 

for i = 1, . . . , n, where the exponential rate > is 
unknown. EB tries to estimate ^ based on the ob- 
served data Zi, . . . , Here we consider a recursive 
estimate of ^. Fix an initial guess xq of ^. Assum- 
ing ^ is equal to xq, the posterior mean of Ai is 
{Zi + l)/(xo + 1), which is a good estimate of 
if Xq is close to ^. Iterating this procedure, we can 
generate a sequence 

(2.4) Xi = Xi_i+Wi — - 

Xi^l Xi^i + l_ 

where {wi} is assumed to satisfy (2.2). Let yi de- 
note the quantity in brackets in (2.4) and take its 
expectation with respect to the distribution of Zi: 

^ — X 



(2.5) h{x)=E{yi\xi 



■ x) 



^x{x + 1) 

Then the sequence in (2.4) is a SA targeting 

a solution of h{x) = 0. Since h is continuous, de- 
creasing and h{x) = iff x = ^, it follows from the 
general theory that a;„ ^ ^. Figure 2 shows the first 
250 steps of such a sequence with xq = 1.5. 

The examples above emphasize one important prop- 
erty that h{x) must satisfy, namely, that it must 
be easy to "sample" in the sense that there is a 
function i/(x, z) and a random variable Z such that 
h{x) =E[ff(x,Z)]. Another thing, which is not ob- 
vious from the examples, is that h[x) must have cer- 
tain stability properties. In general, a SA sequence 



need not have a unique limit point. However, condi- 
tions can be imposed which guarantee convergence 
to a particular solution ^ of h[x) = 0, provided that 
^ is a stable solution to the ODE x = h{x). This is 
discussed further in Section 2.3. 

2.2 Applications 

2.2.1 Stochastic approximation EM. The EM al- 
gorithm [7] has quickly become one of the most pop- 
ular computational techniques for likelihood estima- 
tion in a host of standard and nonstandard statisti- 
cal problems. Common to all problems in which the 
EM can be applied is a notion of "missing data." 

Consider a problem where data Y is observed and 
the goal is to estimate the parameter 9 based on its 
likelihood function L{6). Suppose that the observed 
data Y is incomplete in the sense that there is a 
component Z which is missing — this could be ac- 
tual values which are not observed, as in the case 
of censored data, or it could be latent variables, 
as in a random effects model. Let X = {Y^ Z) de- 
note the complete data. Then the likelihood func- 
tion f{z,9) based on the complete data x is related 
to L{9) according to the formula L{9) = J f{z, 0) dz. 
The EM algorithm produces a convergent sequence 
of estimates by iteratively filling in the missing data 
Z in the E-step and then maximizing the simpler 
complete-data likelihood function f{z,9) in the M- 
step. The E-step is performed by sampling the z- 
values from the density 



p{z\9) 



f{z,9)/m, 
0, 



ifL(0)/O, 
if L(^) = 0, 



which is the predictive density of Z, given Y and 9. 

It is often the case that at least one of the E-step 
and M-step is computationally difficult, and many 
variations of the EM have been introduced to im- 
prove the rate of convergence and/or simplify the 
computations. In the case where the E-step cannot 
be done analytically, Wei and Tanner [39] suggest re- 
placing the expectation in the E-step with a Monte 
Carlo integration. The resulting MCEM algorithm 
comes with its own challenges, however; for example, 
simulating the missing data Znj , for j = 1 , . . . , , 
from p{z\9n) could be quite expensive. 

Delyon, Lavielle and Moulines [6] propose, in the 
case where integration in the E-step is difficult or 
intractable, an alternative to the MCEM using SA. 

SAEM Algorithm. At step n, simulate the 
missing data Znj from the posterior distribution 
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p{z\9n), i = 1, ■ ■ • ,mn- Update Qn{0) using 

rrin 

Qn{0) = (1 - U^n)Qn-l(^) + — V log /(Z„,-, 0), 

where {wn} is a sequence as in (2.2). Then choose 
9n+i such that Qn{On+i) > Qn{0) for ah 6* G 6. 

Compared to the MCEM, the SAEM algorithm's 
use of the simulated data Znj is much more efficient. 
At each iteration, the MCEM simulates a new set 
of missing data from the posterior distribution and 
forgets the simulated data from the previous itera- 
tion. On the other hand, note that the inclusion of 
Qn-i{0) in the SAEM update 0n ^ On+i implies all 
the simulated data points contribute. It is pointed 
out in [6] that the SAEM performs strikingly better 
than the MCEM in problems where maximization is 
much cheaper than simulation. 

Delyon, Lavielle and Moulines [6] show, using gen- 
eral SA results, that for a broad class of complete- 
data likelihoods f{z,6) and under standard regular- 
ity conditions, the SAEM sequence {On} converges 
a.s. to the set of stationary points {9 : VL{6) = 0} 
of the incomplete-data likelihood. Moreover, they 
prove that the only attractive stationary points are 
local maxima; that is, saddle points of L{9) are 
avoided a.s. 

2.2.2 Adaptive Markov Chain Monte Carlo. A ran- 
dom walk Metropolis (RWM) algorithm is a spe- 
cific MCMC method that can be designed to sam- 
ple from almost any distribution vr. In this partic- 
ular case, the proposal is q{x,y) = q{x — y), where 
g is a symmetric density. A popular choice of g' is a 
Np{0,T,) density. It is well known that the conver- 
gence properties of Monte Carlo averages depend 
on the choice of the proposal covariance matrix S, 
in the sense that it affects the rate at which the 
generated stochastic process explores the support of 
TT. Trial and error methods for choosing T, can be 
difficult and time consuming. One possible solution 
would be to use the history of the process to suit- 
ably tune the proposal. These so-called adaptive al- 
gorithms come with their own difficulties, however. 
In particular, making use of the history destroys 
the Markov property of the process so nonstandard 
results are needed in a convergence analysis. For 
instance, when the state space contains an atom, 
Gilks, Roberts and Sahu [15] propose an adaptive 
algorithm that suitably updates the proposal den- 
sity only when the process returns to the atom. The 



resulting process is not Markov, but ergodicity is 
proved using a regeneration argument [15]. 

An adaptive Metropolis (AM) algorithm is pre- 
sented by Haario, Saksman and Tamminen [16], which 
uses previously visited states to update the proposal 
covariance matrix S. Introduce a mean ^ and set 
6 = (/i, S). Let {wn} be a deterministic sequence as 
in (2.2). 

AM Algorithm. Fix a starting point zq and 
initial estimates fiQ and Sq. At iteration n > 1 draw 
Zn from Np{zn-i, cT,n-i) and set 

Sn = (1 - Wn)^n~l + Wn{Zn - /X„_i)(2;„ - ;Un-l)', 
;U„ = (1 - Wn)Hn-l + WnZn- 

Note that if Wn = n~^, then /i„ and S„ are the 
sample mean and covariance matrix, respectively, of 
the observations zi,...,Zn- The constant c in the 
AM is fixed and depends only on the dimension d 
of the support of vr. A choice of c which is, in some 
sense, optimal is c = 2A^/d ([32], page 316). 

It is pointed out in [16] that the AM has the ad- 
vantage of starting the adaptation from the very be- 
ginning. This property allows the AM algorithm to 
search the support of vr more effectively earlier than 
other adaptive algorithms. Note that for the algo- 
rithm of [15] mentioned above, the adaptation does 
not begin until the atom is first reached; although 
the renewal times are a.s. finite, they typically have 
no finite upper bound. 

It is shown in [16] that, under certain conditions, 
the stationary distribution of the stochastic process 
{zn} is the target vr, the chain is ergodic (even though 
it is no longer Markovian), and there is almost sure 
convergence to = (/i,r,S7r), the mean and covari- 
ance of the target vr. This implies that, as n ^ oo, 
the proposal distributions in the AM algorithm will 
be close to the "optimal" choice. If H(z,9) = (z — 
fi, {z — n){z — /i)' — S), then the AM is a general 
SA algorithm with 9n = 9n-i + 'WnH{zn,9n-i), and 
Andrieu, Moulines and Priouret [2] extend the work 
in [16] via new SA stability results. 

2.2.3 Stochastic approximation Monte Carlo. Let 
X he a finite or compact space with a dominat- 
ing measure i^. Let p(x) = Kpo{x) be a probability 
density on X with respect to v with possibly un- 
known normalizing constant k > 0. We wish to es- 
timate J fdu, where / is some function depending 
on p or pq. For example, suppose p{x) is a prior 
and g{y\x) is the conditional density of y given x. 
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Then f{x) = g{y\x)p{x) is the unnormahzed poste- 
rior density of x and its integral, the marginal den- 
sity of y, is needed to compute a Bayes factor. 

The following stochastic approximation Monte 
Carlo (SAMC) method is introduced in [21]. Let 
Ai, . . . , Am be a partition of X and let rji = f dv 
for 1 <i <m. Take 77i(0) as an initial guess, and let 
'fji{n) be the estimate of r]i at iteration n > 1. For 
notational convenience, write 

Oni = \og7]i{n) and On = (6'nl, • • • , Onm)' ■ 

The probability vector vr = (tti, . . . , -Km)' will denote 
the desired sampling frequency of the ^i's; that is, 
TTj is the proportion of time we would like the chain 
to spend in A^. The choice of vr is flexible and does 
not depend on the particular partition {^i, . . . , Am}- 

SAMC Algorithm. Starting with initial esti- 
mate for > simulate a sample Zn+i using a 
RWM algorithm with target distribution 

m 

(2.6) K^|^„)oc^/(z)e-^-/^,(z), z£X. 

i=l 

Then set 6n.+i = dn + Wn+i{Cn+i — 71"), where the de- 
terministic sequence {wn} is as in (2.2), and Cn+i = 

{lAAZn+l),---,lA^{Zn+l)y- 

The normalizing constant in (2.6) is generally un- 
known and difficult to compute. However, p{z\6n) is 
only used at the RWM step where it is only required 
that the target density be known up to a proportion- 
ality constant. 

It turns out that, in the case where no Ai are 
empty, the observed sampling frequency TCi of Ai 
converges to vTj . This shows that tTj is independent of 
its probability pdv. Consequently, the resulting 
chain will not get stuck in regions of high probabil- 
ity, as a standard Metropolis chain might. 

The sequence {9n} is a general stochastic approx- 
imation and, using the convergence results of [2], 
Liang, Liu and Carroll [21] show that if no Ai is 
empty and suitable conditions are met, then 

(2.7) 9ni^C + \og fdu-logTTi a.s., 

for 1 < i < m as n— >-oo, for some arbitrary con- 
stant C. Liang, Liu and Carroll [21] point out a lack 
of identifiability in the limit (2.7); that is, C cannot 
be determined from {9n} alone. Additional informa- 
tion is required, such as X^^l??^(?^) = c for each n 
and for some known constant c. 



In Example 2.4, we apply SAMC to estimate the 
partition function in the one-dimensional Ising model. 
In this simple situation, a closed-form expression is 
available, which we can use as a baseline for assess- 
ing the performance of the SAMC estimate. 

Example 2.4. Consider a one-dimensional Ising 
model, which assumes that each of the d particles in 
a system has positive or negative spin. The Gibbs 
distribution on X = {—1, l}'^ has density (with re- 
spect to counting measure z/) 

PTix) = ^e"^(^)/^ ZiT) = Y: e-^(^)/^, 

where T is the temperature, and E is the energy 

function defined, in this case, as E{x) = — Yli=i ^i^i+i- 
The partition function Z(T) is of particular inter- 
est to physicists: the thermodynamic limit F(T) = 
limd^oo log Z{T) is used to study phase transi- 
tions [4]. In this simple case, a closed-form expres- 
sion for Z{T) is available. There are other more com- 
plex systems, however, where no analytical solution 
is available and I'iX) = 2'^ is too large to allow for 
naive calculation of Z{T). 

Our jumping-off point is the energy-temperature 
duality [18] Z{T) = E« where 9.{u) = 

v{x : E{x) = u} is the density of states. We will ap- 
ply SAMC to first estimate J7(n) and then estimate 
Z{T) with a plug-in: 

Z{T) = Y,fl{u)e--'^. 

u 

Note here that a single estimate of il. can be used 
to estimate the partition function for any T, elimi- 
nating the need for simulations at multiple tempera- 
tures. Furthermore, X^„f^('u) = i^{X) = 2*^ is known 
so, by imposing this condition on the estimate we 
do not fall victim to the lack of identifiability men- 
tioned above. Figure 3 shows the true partition func- 
tion Z{T) = 2'^ cosh'^-i(l/r) for d = 10 as well as the 
SAMC estimate Z{T) as a function of T G [1)4], on 
the log-scale, based on n = 1000 iterations. Clearly, 
Z performs quite well in this example, particularly 
for large T. 

2.3 ODE Stability Theory 

The asymptotic theory of ODEs plays an impor- 
tant role in the convergence analysis of a SA al- 
gorithm. After showing the connection between SA 
and ODEs, we briefly review some of the ODE the- 
ory that is necessary in the sequel. 
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^1 1 1 1 1 1 r-^ 

1.0 1.5 a.O 2.S 3.0 3.5 4.0 

T 

Fig. 3. log Z{T) (gray) and SAMC estimate log Z{T) 
(black) in Example 2.4- 



Recall the general SA algorithm in (2.1) given 
by Xn = Xn-i + WnVn- Assume there is a measur- 
able function h such that h{xn-i) =E[y„|xn-i] and 
rewrite this algorithm as 

Xn = Xn-l + Wnh{Xn-l) + WniVn " /l(x„_i)}. 

Define M„ = y-n — h{xn-i)- Then {M„} is a zero- 
mean martingale sequence and, under suitable con- 
ditions, the martingale convergence theorem guar- 
antees that Mfi becomes negligible as n ^ oo, leav- 
ing us with 

Xn = Xn-l + Wnh{Xn~l) + WnMn 
K Xn-l+WnKXn-l). 

But this latter "mean trajectory" is deterministic 
and essentially a finite difference equation with small 
step sizes. Rearranging the terms gives us 

Xn Xn—1 , / N 

= h[Xn-l), 

Wn 

which, for large n, can be approximated by the ODE 
X = h{x). It is for this reason that the study of SA 
algorithms is related to the asymptotic properties of 
solutions to ODEs. 

Consider a general autonomous ODE x = h{x), 
where /i : M'^ ^ M*^ is a bounded and continuous, pos- 
sibly nonlinear, function. A solution x{t) of the ODE 
is a trajectory in M*^ with a given initial condition 
x{0). Unfortunately, in many cases, a closed- form 
expression for a solution x{t) is not available. For 
that reason, other methods are necessary for study- 
ing these solutions and, in particular, their proper- 
ties as t ^ oo. 



Imagine a physical system, such as an orbiting 
celestial body, whose state is being governed by the 
ODE X = h{x) with initial condition x{0) = xq. Then, 
loosely speaking, the system is stable if choosing an 
alternative initial condition x(0) = Xg in a neighbor- 
hood of xq has little effect on the asymptotic prop- 
erties of the resulting solution x{t). The following 
definition makes this more precise. 

Definition 2.5. A point ^ G M"^ is said to be 
locally stable for x = h{x) if for each e > there is a 
(5 > such that if ||a;(0)-C|| < (5, then \\x{t)-^\\ < e 
for all t > 0. If ^ is locally stable and x{t) — )• ^ as 
t —?■ oo, then ^ is locally asymptotically stable. If this 
convergence holds for all initial conditions x(0), then 
the asymptotic stability is said to be global. 

Points ^ for which stability is of interest are equi- 
librium points oi X = h{x). Any point ^ such that 
/i(^) = is called an equilibrium point, since the 
constant solution x{t) = ^ satisfies x = 

Example 2.6. Let x = Ax, where A is a fixed 
d X d matrix. For an initial condition x(0) = xq, we 
can write an explicit formula for the particular so- 
lution: x{t) = e^^xo for t > 0. Suppose, for simplic- 
ity, that A has a spectral decomposition A = UAU' , 
where U is orthogonal and A is a diagonal matrix of 
the eigenvalues Ai, . . . , of yl. Then the matrix ex- 
ponential can be written as e^* = U e^*C/', where e^* 
is diagonal with ith element e'^'*. Clearly, if Aj < 0, 
then e^'* — > as i — )• oo. Therefore, if A is negative 
definite, then the origin x = is globally asymptot- 
ically stable. 

When explicit solutions are not available, proving 
asymptotic stability for a given equilibrium point 
will require a so-called Lyapunov function [20] . 

Definition 2.7. Let ^ G M'^ be an equilibrium 
point of the ODE x = h{x) with initial condition 
x(0) = Xq. A function £ : M*^ ^ R is called a Lyapunov 
function (at ^) if: 

• I has continuous first partial derivatives in a neigh- 
borhood of ^ ; 

• (-{x) > with equality if and only if x = ^; 

• the time derivative of I along the path x(t), de- 
fined as t{x) = Vl{x)'h{x), is < 0. 

A Lyapunov function is said to be strong if £(x) = 
implies x = ^. 
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Lyapunov functions are a generalization of the po- 
tential energy of a system, such as a swinging pen- 
dulum, and Lyapunov's theory gives a formal ex- 
tension of the stability principles of such a system. 
Theorem 2.8 is very powerful because it does not 
require an explicit formula for the solution. See [20] 
for a proof and various extensions of the Lyapunov 
theory. 

Theorem 2.8. // there exists a (strong) Lya- 
punov function in a neighborhood of an equilibrium 
pointy ofx = h[x), then^ is (asymptotically) stable. 

There is no general recipe for constructing a Lya- 
punov function. In one important special case, how- 
ever, a candidate Lyapunov function is easy to find. 
Suppose h{x) = — Vg{x), for some positive definite, 
sufficiently smooth function g. Then l!.[x) = g(x) is 
a Lyapunov function since £{x) = —\\'Vg{x)\\'^ < 0. 

Example 2.9. Consider again the linear system 
X = Ax from Example 2.6, where A is a d x d nega- 
tive definite matrix. Here we will derive asymptotic 
stability by finding a Lyapunov function and apply- 
ing Theorem 2.8. In light of the previous remark, we 
choose i{x) = -\x'Ax. Then (.{x) = < so 

^ is a strong Lyapunov function for x = Ax and the 
origin is asymptotically stable by Theorem 2.8. 

Of interest is the stronger conclusion of global 
asymptotic stability. Note, however, that Theorem 
2.8 does not tell us how far xq can be from the equi- 
librium in question and still get asymptotic stability. 
For the results that follow, we will prove the global 
part directly. 

2.4 SA Convergence Theorem 

Consider, for fixed xq and {wn} satisfying (2.2), 
the general SA algorithm 

(2.8) Xn = VTO]x{^n^l+Wnyn}, ^^ > 1, 

where X C M'^ is compact and Proj ^(2;) is a projec- 
tion of X onto X. The projection is necessary when 
boundedness of the iterates cannot be established by 
other means. The truncated or projected algorithm 

(2.8) is often written in the alternative form [19] 

(2.9) Xn = Xn-l+Wnyn + WnZn: 

where z„, is the "minimum" z such that x„_i -|- 
WnVn + uinZ belongs to X. 

Next we state the main stochastic approximation 
result used in the sequel, a special case of The- 
orem 5.2.3 in [19]. Define the filtration sequence 

= 0-(yi,...,yn). 



Theorem 2.10. For {xn} in (2.8) with {wn} 
satisfying (2.2), assume 

(SAl) sup„E||y„|p < 00. 

(SA2) There exists a continuous function h{-) and 
a random vector fin such that E(y„|^„_i) = 
h{xn-i) + (3n a.s. for each n. 

(SA3) X^n'ii'nll/^nll Converges a.s. 

If ^ is globally asymptotically stable for x = h{x), 
then Xn ^ £, a.s. 

3. NEWTON'S RECURSIVE ESTIMATE 

Let G and X be the parameter space and sam- 
ple space, equipped with cj-finite measures ji and 
v, respectively. Typically, G and X are subsets of 
Euclidean space and is Lebesgue or counting mea- 
sure. The measure n varies depending on the infer- 
ence problem: for estimation, is usually Lebesgue 
or counting measure, but for testing, is often some- 
thing different (see Example 3.1). 

Consider the following model for pairs of random 
variables (Xj, 61*) G <Y x G: 

(3.1) e^'-'^^-f, Xi\e''^%{-\9'), i = l,...,n, 

where {p{-\0) ■ € Q} is a parametric family of prob- 
ability densities with respect to on A" and / is a 
probability density with respect to on G. In the 
present case, the variables (parameters) 6^,...,6"' 
are not observed. Therefore, under model (3.1), 
Xi, . . . , Xn are i.i.d. observations from the marginal 
density 11 j in (1.1). We call / the mixing density 
(or prior, in the Bayesian context) and the inference 
problem is to estimate / based on the data observed 
from Hf. The following example gives a very impor- 
tant special case of this problem — the analysis of 
DNA microarray data. 

Example 3.1. A microarray is a tool that gives 
researchers the ability to simultaneously investigate 
the effects of numerous genes on the occurrence of 
various diseases. Not all of the genes will be ex- 
pressed — related to the disease in question — so the 
problem is to identify those which are. Let 0* repre- 
sent the expression level of the ith. gene, with 0* = 
indicating the gene is not expressed. After some re- 
duction, the data Xi is a measure of 9^, and the 
model is of the form (3.1) with / being a prior den- 
sity with respect to /i = ALcb + (^{o} • Consider the 
multiple testing problem 

Hoi:9'=0, i = l,...,n. 
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The number n of genes under investigation is often 
in the thousands so, with httle information about 0* 
in Xi, choosing a fixed prior / would be problematic. 
On the other hand, the data contain considerable 
information about the prior / so the empirical Bayes 
approach — using the data to estimate the prior — 
has been quite successful [9]. 

In what follows, we focus our attention on a partic- 
ular estimate of the mixing density /. Let xi , . . . , a;„ € 
X be i.i.d. observations from the mixture density 
lif in (1.1). Newton [26] suggests the following al- 
gorithm for estimating /. 

Newton's Algorithm. Choose a positive den- 
sity /o on © and weights wi,...^Wn G (0,1). Then 
for i = 1, . . . , n, compute 

(3.2) /.(^) = (l--.)/.-i(^) + -.^%^^^, 

where ^j{x) = J p{x\6)fj{9)dfi{9), and report fni^) 
as the final estimate. 

In the following subsections we establish some 
asymptotic properties of fn as n — >■ cx) and we show 
the results of several numerical experiments that 
demonstrate the finite-sample accuracy of Newton's 
estimate (3.2) in both the discrete and continuous 
cases. First, a few important remarks. 

• The update i— t- fi in (3.2) is similar to a Bayes 
estimate based on a Dirichlet process prior (DPP), 
given the information up to, and including, time 
i — 1. That is, after observing Xi, X2, . . . , 

a Bayesian might model / with a DPP ^( , 
fi-i). In this case, the posterior expectation is 
exactly the fi in (3.2). 

• Because fn depends on the ordering of the data 
and not simply on the sufficient statistic (-^(1) , • • . , 
X(„)), it is not a posterior quantity. 

• The algorithm is very fast: if one evaluates (3.2) 
on a grid of m points 61,. . . ,9m and calculates the 
integral in Ilj-i using, say, a trapezoid rule, then 
the computational complexity is mn. 

3.1 Review of Convergence Results 

In this section, we give a brief review of the known 
convergence results for Newton's estimate fn in the 
case of a finite parameter space. The case of a com- 
pact is quite different and, until very recently [38], 
nothing was known about the convergence of fn in 
such problems; see Section 5. 

Newton [26], building on the work in [27, 28], 
states the following convergence theorem. 



Theorem 3.2. Assume the following. ■ 
(Nl) Q is finite and /i is counting measure. 

(N2) En^n = 00. 

(N3) p{x\9) is bounded away from and 00. 

Then surely there exists a density f^o on Q such 
that fn — 5- foo as 00. 

Newton [26] presents a proof of Theorem 3.2 based 
on the theory of nonhomogeneous Markov chains. 
He proves that /„ represents the n-step marginal 
distribution of the Markov chain {Zn} given by 



Zq ~ /o, Zn 



Zn-i, with prob 1 — Wr, 
Yn, with prob Wn, 



where Yn has density oc p{Xn\9)fn~i{9). However, 
the claim that this Markov chain admits a station- 
ary distribution is incomplete — N2 implies the chain 
{Zn} is weakly ergodic but the necessary strong er- 
godicity property does not follow, even when is 
finite. Counterexamples are given in [14, 17]. Ghosh 
and Tokdar [14] prove consistency of /„ along quite 
different lines. For probability densities ip and (p 
with respect to fj,, define the Kullback-Liebler (KL) 
divergence, 

(3.3) K{^,ip)= [ il;log{^/^)dfi. 

Je 

The following theorem is proved in [14] using an 
approximate martingale representation of K{f,fn). 

Theorem 3.3. In addition to N1-N3, assume 

(GTl) En<<^- 

(GT2) / is identifiable; that is, / 1-^ Hj is infective. 

Then K{ f, fn) — )• a.s. as 00. 

Part of the motivation for the use of the KL di- 
vergence lies in the fact that the ratio fn/fn-i has 
a relatively simple form. More important, however, 
is the Lyapunov property shown in the proof The- 
orem 3.4. Sufficient conditions for GT2 in the case 
of finite G are given in, for example, [22, 37]. San 
Martin and Quintana [33] also discuss the issue of 
identifiability in connection with the consistency of 

fn- 

3.2 Newton's Estimate as a SA 

Here we show that Newton's algorithm (3.2) is a 
special case of SA. First, note that if / is viewed 
as a prior density, then estimating / is an empirical 
Bayes (EB) problem. The ratio in (3.2) is nothing 
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but the posterior distribution of 9, given Xj, and 
assuming that the prior / is equal to fi^i- This, in 
fact, is exactly the approach taken in Example 2.3 
to apply SA in an EB problem. 

Let n be counting measure and d = ^(0). We can 
think of fn{6) as a vector /„ = (/^, . . . , /^)' in the 
probability simplex A*^, defined as 



i=l 



1 



Define H : X yi /S.'^ ^ with kih. component 



(3.4) Hk{x,^) 



/c = l,...,d. 



where Iiip{x) = YlikP{^\^k)'^^ is the marginal density 
on X induced by (/? € A'^. Then (3.2) becomes 



(3.5) 



fn — fn-l + WnH{Xn, fn-l)- 



Let Px = diag{p(x|6'fc) : k = 1, . . . ,d} be the diagonal 
matrix of the sampling density values and define the 
mapping h : A'^ to be the conditional expecta- 

tion of H{x, fn), given /„ = y?: 



(3.6) 



X 



H{x, ip)Iif (x) dv{x) 



X 



n<^(x) 



Px^dv{x) - if, 



where / = {f^, . . . , f^)' is the true mixing/prior dis- 
tribution. From (3.6), it is clear that / solves the 
equation h{ip) = which implies (i) / is an equilib- 
rium point of the ODE = h{ip), and (ii) that / is 
a fixed point of the map 



T{ip) = h{Lp) + ip- 



n/(x) 
n^(x) 



Px'fdv{x). 



Newton [26], page 313, recognized the importance of 
this map in relation to the limit of fn- Also, the use 
of T in [5, 35] for the /-projection problem is closely 
related to the SA approach taken here. 

We have shown that (3.5) can be considered as a 
general SA algorithm, targeting the solution (f = f 
of the equation h{ip) = in A'^. Therefore, the SA 
results of Section 2.4 can be used in the conver- 
gence analysis. The following theorem is proved in 
Appendix A.l. 

Theorem 3.4. Assume Nl, N2, GTl and GT2. 
If p{-\6) > v-a.e. for each 9, then fn^ f cl-s. 



Remark 3.5. Removal of the boundedness con- 
dition N3 on p{x\9) in Theorem 3.4 extends the con- 
sistency result of [14] to many important cases, such 
as mixtures of normal or gamma densities. 

Remark 3.6. Theorem 3.4 covers the interior 
case (when / is strictly positive) as well as the bound- 
ary case (when /* = for some i). The fact that 
/q > implies fn> for all n suggests that conver- 
gence may be slow in the boundary case. 

3.3 Simulations 

Here we provide numerical illustrations comparing 
the performance of Newton's estimate with that of 
its competitors. We consider a location-mixture of 
normals; that is, p{-\9) is a N{9,a^) density The 
weights are set to be Wi = {i + and the initial 
estimate /o is taken to be a Unit (0) density. For the 
Bayes estimate, we assume a Dirichlet process prior 
/ ~ ^(1, /o) in each example. 

Example 3.7 {Finite Q). In this example, we 
compare Newton's recursive (NR) estimate with the 
nonparametric maximum likelihood (NPML) esti- 
mate and the nonparametric Bayes (NPB) estimate. 
Computation of NR and NPML (using the EM algo- 
rithm) is straightforward. Here, in the case of finite 
Q, we use sequential imputation [23] to calculate 
NPB. Take G = Zn[-4,4], and set cj = 1 in p{x\9). 
We consider two different mixing distributions on 
9: 

I. / = Bin(8,0.6), 
n. / = 0.5(5{„2}+0.5(5{2}. 

We simulate 50 data sets of size n = 100 from the 
models corresponding to mixing densities I, H and 
computing the three estimates for each. Figure 4 
shows the resulting estimates for a randomly chosen 
data set from each model. Notice that NR does bet- 
ter for model I than both NPML and NPB. The 
story is different for model II — both NPML and 
NPB are considerably better than NR. This is fur- 
ther illustrated in Figure 5 where the KL divergence 
K(nf, Hn) on X = M is summarized over the 50 sam- 
ples. We see that NR has a slightly smaller KL num- 
ber than NPML and NPB for model I, but they 
clearly dominate NR for model II. This discrepancy 
is at least partially explained by Remark 3.6; see 
Section 5 for further discussion. We should point 
out, however, that both NPML and NPB take sig- 
nificantly longer to compute than NR, about 100 
times longer on average. 
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Fig. 4. Estimates of mixing densities I and II in Example 3.7. Left column: True f (gray) for model I and the three estimates 
(black). Right column: True f (gray) for model II and the three estimates (black). 
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Fig. 5. Summary of the KL divergence K{Ilf,Iln) for the 
three estimates Tl„ in models I and II in Example 3. 7. 

Example 3.8 {Compact Q). We consider a one- 
and a two-component mixture of beta densities on 
= [0, 1] as the true /: 

I. / = Beta(2,7), 

II. / = 0.33 Beta(3, 30) + 0.67 Beta(4, 4) . 

Let fj = 0.1 be the normal sampHng variance. Again, 
computation of NR is straightforward. To compute 
NPB, the importance samphng algorithm in [38] 
that makes use of a collapsing of the Polya Urn 
scheme is used. Figure 6 shows a typical realiza- 
tion of NR and NPB, based on a sample of size 
n = 100 from each of the corresponding marginals. 
Note that the Bayes estimate does a rather poor 
job here, being much too spiky in both cases. This 
is mainly because the posterior for / sits on dis- 
crete distributions. On the other hand, Newton's es- 
timate has learned the general shape of / after only 
100 iterations and results in a much better estimate 
than NPB. Furthermore, on average, the computa- 
tion time for NR is again about 100 times less than 
that of NPB. 

4. N+P ALGORITHM 

Suppose that the sampling distribution on X is 
parametrized not only by 9 but by an additional 
parameter ^. An example of this is the normal dis- 
tribution with mean 6 and variance ^ = cx^. More 
specifically, we replace the sampling densities p{x\9) 
of Section 3 with p{x\0, ^) where 9 is the latent vari- 
able, and ^ is also unknown. Newton's algorithm 



cannot be used in this situation since 9 does not 
fully specify the sampling density. 

In this section we introduce a modification of New- 
ton's algorithm to simultaneously and recursively es- 
timate both a mixing distribution and an additional 
unknown parameter. This modification, called the 
Newton-l-Plug-in (N-l-P), is actually quite simple — 
at each step we use a plug-in estimate of in the 
update (3.2). We show that the N-l-P algorithm can 
be written as a general SA algorithm and, under 
certain conditions, prove its consistency. 

Let p{x\9,S^) be a two-parameter family of densi- 
ties on X , and consider the model 



(4.1) 



i.i.d. „ 



i.i.d. 



L jx , . . . , -^ir 



,n, 



where / is an unknown density on and the param- 
eter € H is also unknown. The number of replicates 
r > 2 is assumed fixed. Note that (4.1) is simply a 
nonparametric random effects model. 

Assume, for simplicity, that H C M; the more gen- 
eral case H C RP is a natural extension of what fol- 
lows. Let © = {6*1, . . . , 6d} be a finite set and take fi 
to be counting measure on Q. Recall that A*^ is the 
probability simplex. Assume: 

(NPl) ^ G int(Ho), where Hq is a compact and con- 
vex subset of H. 

(NP2) / € int(Ao) where Aq C A'^ is compact and, 
for each ip € Aq , the coordinates ip^,. . . ,ip'^ 
are bounded away from zero. 

The subset Hq can be arbitrarily large so assump- 
tion NPl causes no difficulty in practice. Assump- 
tion NP2 is somewhat restrictive in that / must 
be strictly positive. While NP2 seems necessary to 
prove consistency (see Appendix A. 3), simulations 
suggest that this assumption can be weakened. 

The N-|-P algorithm uses an estimate of at each 
step in Newton's algorithm. We assume here that an 
unbiased estimate is available: 

(NP3) There exists an unbiased estimate Tuqe{x), 
X € X^ , of with variance < oo. 

Later we will replace the unbiased estimate with a 
Bayes estimate. This will require replacing NP3 with 
another assumption. 

At time i = 1, . . . , n, we observe an r- vector Xi = 
{Xii, . . . ,Xij.)' and we compute = ruBE(-'^i)- 
unbiased estimate of ^ based on the entire data Xi , . . . 
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Xn would be the average 
has a convenient recursive expression 

(4.2) ^, = i-\i-l)^,^^ + ii% i 



More importantly, by construction, ^^^^ 
i.i.d. random variables with mean ^ and finite vari- 
ance. It is, therefore, a consequence of the SLLN that 
^ni as defined in (4.2), converges a.s. to ^. While this 
result holds for any unbiased estimate T, an unbi- 
ased estimate T' with smaller variance is preferred, 
since it will have better finite-sample performance. 

Define the mapping : A'^ X Ao X Ho ^ M"' with 
fcth component 



n ^ C'-*'' ) which trary £ ^0- Then for z = 1, . . . , re compute 

Ci = ProjHjri[(i-i)ei-i+e^'^]}, 

^^(") are and produce {fn,£,n) as the final estimate. 



1, 



,re. 



We claim that the N+P algorithm for estimating / 
can be written as a general SA involving the true but 
unknown ^ plus an additional perturbation. Define 
the quantities 



(4.3) Hk{x,^,i;) 



p{x\9k,ij)^p'' 



(4.4) h{U.{)=nH{XnJn-l,iWn-l], 
Pn=nH{Xn,fn-uin)\^n~l] 

(4.5) 

-¥.[H{XnJn-l,iWn^l\, 

where ,'^n-i = c(Xi, . . . ,X„_i), so that 

nH{XnJn-l,inWn-l] = h{fn-l)+Pn. 



for A; = 1, . . . ,d, where (p and ijj denote generic el- 
ements in Ao and Hq, respectively, and p{-\6,ilj) is Now the update i-^- /„ can be written as 
the joint density of an i.i.d. sample of size r from 

p{-\e,^). 



N-l-P Algorithm. Choose an initial estimate 
/o G Aq, weights wi,...^Wn G (0,1), and an arbi- 



(4.6) fn = fn-l + Wn{h{fn-l) + M„ + /3„ + 

where z„ is the "minimum" z keeping /„ in Aq, and 



H{Xni fn-liin) — ^(/n-l) — Pn 
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Fig. 6. Estimates of the mixing densities I and II in Example 3.8. Top row: true f (gray) and NR (black). Bottom row: 
true f (gray) and NPB (black). 
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is a martingale adapted to ^n-\- Notice that (4.6) Proposition 4.3. NP4 holds for H in (4.1). 



is now in a form in which Theorem 2.10 can be ap- 
plied. We win make use of the Law of Iterated Log- 
arithm so define M(t) = (2t loglogt)^/^. The consis- 
tency properties of the N+P algorithm are summa- 
rized in the fohowing theorem. 

Theorem 4.1. Assume Nl, GTl, GT2, NP1~ 
NP3. In addition, assume 

(NP4) -^H{x]^p,il)) is bounded on A''' x Aq x Hq. 
(NP5) ^„ -'^n(n) converges. 

Then (/n, ^n) (/, 6 a.s. asn^oo. 

We now remove the restriction to unbiased esti- 
mates of ^, focusing primarily on the use of a Bayes 
estimate in place of the unbiased estimate. But first, 
let = T{Xi, . . . , Xi) be any suitable estimate of ^ 
based on only Xi, . . . ,Xi. Then replace the N+P 
update fi-i I-)- fi with 

/j = Proj { /i- 1 + (^i , /i- 1 , ) } • 

While this adaptation is more flexible with regard 
to the choice of estimate, this additional flexibility 
does not come for free. Notice that the algorithm is 
no longer recursive. That is, given a new data point 
Xn+i, we need more information than just the pair 
(/n,ln) to obtain (/„+i,|„+i). 

Corollary 4.2. If assumptions NP3 and NP5 
in Theorem 4-1 are replaced by 

(NP3') ||„-e|=0(p„) a.s. as n^oo, 

(NP5') Y.nWnPn<(^, 

then (/n,|n)^(/,0 a.s. asn-^-oo. 

Typically, for Bayes and ML estimates, the rate is 
Pn = n~^/^. Then NP5' holds if, e.g., Wn ~ n~^. 

To illustrate the N+P and its modifled version, 
consider the special case where p{-\0,S,) in (4.1) is 
a normal density with mean 6 and ^ = o"^ is the 
unknown variance. That is, 



Xii,...,X„"'^^-N{9\a'), i = l, , 

Moreover, the statistic Si = Xi is sufficient for the 
mean and the density g{-\0, cr^) of Si is known. There- 
fore, H in (4.3) can be written as 



(4.7) Hk{s,ip,^l^) 



'Ej9{s\0j,'>p)(pi 



for = 1, . . . , d, where g{s\6, ip) is the N{6,tp/r) den- 
sity. Even in this simple example, it is not obvious 
that the function H in (4.7) satisfies NP4. A proof 
of the following proposition is in Appendix A. 3. 



Let So be the Hq defined in the general setup. 
For the N+P, we choose Tube (2^) to be the sample 
variance of x, resulting in the recursive estimate 



(4.5 



i{r — 1) 



fc=l jr = l 

For (T^, take the standard noninformative prior p{cr'^) 
(cr^)~^. Under squared-error loss, the Bayes estimate 
of o"^ based on Xi, . . . , Xi is 



(4.9) 



i[r 



k=ij=i 

1/2N 



Note that \a^ — a \ = 0{n" ' ) a.s. so the conclusion 
of Corollary 4.2 holds if Wn ~ n~^. 

The following example compares three resulting 
estimates for this location mixture of normals prob- 
lem: when o"^ is known, when (4.8) is used with the 
N+P and when (4.9) is used in the modified N+P. 
Convergence of the iterates holds in each case by 
Theorems 3.4 and 4.1 and Corollary 4.2. 

Example 4.4. Let 6 = Zn [-4,4] and take / 
to be a Bin(8,0.5) density on 0. Suppose r = 10, 
n = 100, Wi = {i + 1)"^ and set o"^ = 1.5. For each of 
100 simulated data sets, the three estimates of / are 
computed using Newton's algorithm, the N+P and 
the Bayes modification. Each algorithm produces es- 
timates / and (T^ with which we compute 11^(5) = 
J2j=i di^l^jj^'^)/''- Figure 7 summarizes the 100 KL 
divergences K{Il, U^) for each of the three estimates. 
Surprisingly, little efficiency is lost when an estimate 
of cj^ is used rather than the true value. Also, the 
N+P and the Bayes modification perform compa- 
rably, with the Bayes version performing perhaps 
slightly better on average. Note that no projections 
onto So = [10^"^, 10"^] or 

Ao = {y? G A : / > 10""^, k = l,...,d} 

were necessary in this example. 

5. DISCUSSION 

In this paper, we have used general results in the 
area of stochastic approximation to prove a consis- 
tency theorem for a recursive estimate of a mixing 
distribution/prior in the case of a finite parameter 
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a- 




Unbiased 



Bayes 



Fig. 7. Summary of KL divergences ^^(11,11^) for the three 
algorithms in Example 4-4- 

space 0. It is natural to wonder if this theorem 
can be extended to the case where / is an infinite- 
dimensional parameter on an uncountable space @. 
Very recently, Tokdar, Martin and Ghosh [38] have 
proved consistency of /„ in the infinite-dimensional 
case, under mild conditions. Their argument is based 
on the approximate martingale representation used 
in [14] but applied to the KL divergence K(Ilf,Iln) 
between the induced marginals. Again, there is a 
connection between their approach and the SA ap- 
proach taken here, namely, K{Ilf,Il^) is also a Lya- 
punov function for the associated ODE ip = h{ip). 

In addition to convergence, there are some other 
interesting theoretical and practical questions to con- 
sider. First and foremost, there is the question of 
rate of convergence which, from a practical point 
of view, is much more important than convergence 
alone. We expect that, in general, the rate of conver- 
gence will depend on the support of /o, the weights 
Wn and, in the case of an uncountable 0, the smooth- 
ness of /. Whatever the true rate of convergence 
might be. Example 3.7 (model II) demonstrated that 
this rate is unsatisfactory when the support of / is 
misspecified. For this reason, a modification of the 
algorithm that better handles such cases would be 
desirable. 

Another question of interest goes back to the orig- 
inal motivation for Newton's recursive algorithm. 
To an orthodox Bayesian, any method which per- 
forms well should be at least approximately Bayes. 
Stemming from the fact that the recursive estimate 
and the nonparametric Bayes estimate, with the ap- 
propriate Dirichlet process prior, agree when n = l, 



Newton et al. [26-28] claim that the former should 
serve as a suitable approximation to the latter. Our 
calculations in Section 3.3 disagree. In particular, we 
see two examples in the finite case, one where the re- 
cursive estimate is significantly better and the other 
where the Bayes estimate is significantly better. A 
new question arises: if it is not an approximation 
of the Dirichlet process prior Bayes estimate, for 
what prior does the recursive estimate approximate 
the corresponding Bayes estimate? 

Finally, it should be pointed out that our approach 
to the finite mixture problem is somewhat less gen- 
eral than would be desirable. In particular, we are 
assuming that the support of / is within a known 
finite set of points. In general, however, what is 
known is that the support of / is contained in, say, 
a bounded interval. In this set of grid points 

Q = {9i, . . . , 9m} are chosen to approximate the un- 
known support G* = {91, . . . , of /. Newton's al- 
gorithm will produce an estimate fn on G in this 
case, but it is impossible to directly compare /„ 
to / since their supports G and G* may be en- 
tirely different. There is no problem comparing the 
marginals, however. This leads us to the following 
important conjecture, closely related to the so-called 
/-projections in [5, 35]. 

Conjecture. Let Uf^ and Uf be the marginal 
densities corresponding to fn on G and f on Q* , 
respectively. Then, as n - 



oo, 



infK(n/,n^) 

if 



a.s. 



where ip ranges over all densities on G. 

Despite these unanswered practical and theoret- 
ical questions, the strong performance of Newton's 
algorithm and the N+P algorithm in certain cases 
and, more importantly, their computational cost- 
effectiveness, make them very attractive compared 
to the more expensive nonparametric Bayes esti- 
mate or the nonparametric MLE, and worthy of fur- 
ther investigation. 

APPENDIX: PROOFS 

A.l Proof of Theorem 3.4 

To prove the theorem, we need only show that 
the algorithm (3.5) satisfies the conditions of Theo- 
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rem 2T0. First note that /„ is, for each n, a convex 
combination of points in the interior of A"^ so no 
projection as in (2.8) is necessary. Second, the ran- 
dom variables /3n in assumption SA2 are identically 
zero so SA3 is trivially satisfied. 

Let {un} be a convergent sequence in A'^, where 
Un = {u\, . . . ^uf^y . The limit u = {v} , . . . ,u'^)' = 
lim^^oo'Un also belongs to A so h{u) is well de- 
fined. To prove that h = (hi, . . . , /i^)' is continuous, 
we show that hk{un) — > hf:{u) for each k = 1, . . . ,d 
as n — > oo. Consider 



hk{Un) 



Iif{x) du{x) - 



The integrand p{-\6k)un/Ilu„{-) is nonnegative and 
bounded v-a.e. for each k. Then by the bounded 
convergence theorem we get 

lim = /ifc(u), k = l,...,d. 

n— >oo 

But {un} C A'^ was arbitrary so h is continuous. 

Next, note that H{x, fn) is the difference of two 
points in A*^ and is thus bounded independent of 
X and n. Then SAl holds trivially. 

Finally, we show that / is globally asymptoti- 
cally stable for the ODE ip = h{ip) in A'^. Note that 
Yli=i 9^* = Yli=i = so the trajectories lie on 
the connected and compact A''. Let i{ip) be the KL 

divergence, = Ylk=i f'^^'^sif'^ /^'')- We claim 
that ^ is a strong Lyapunov function for ip = h{ip) 
at /. Certainly i{(p) is positive definite. To check the 
differentiability condition, we must show that 
has a well-defined gradient around /, even when / 
is on the boundary of A*^. Suppose, without loss 
of generality, that f^,. . . , are positive, I < s < d, 
and the remaining f^~^^, ■ ■ ■ , are zero. By defini- 
tion, £{ip) is constant in ip'^'^^, • • • and, therefore, 
the partial derivatives with respect to those ip's are 
zero. Thus, for any 1 < s < d and for any such 
that i{{p) < oo, the gradient can be written as 

(A.l) V%) = -{r\r\ + r'I'„ 

where r^ = /(p^ and Is is a vector whose first s 
coordinates are one and last d — s coordinates are 
zero. The key point here is that the gradient of i{(p), 
for ip restricted to the boundary which contains /, is 
exactly (A.l). We can, therefore, extend the defini- 
tion of V£((/?) continuously to the boundary if need 
be. 



Given that \/l{ip) exists on all of A'^, the time 
derivative of £ along ip is 



(A.2) 



i{^) = Vl{^)'h{^) 

n/(x)-ny(x) 
n^(x) 



Vl{p>)'P^pdv{x) 



lif dv. 



It remains to show that i{ip) = iff (/9 = / . Applying 
Jensen's inequality to y i— >■ in (A.2) gives 



i{ip) 



(A.3) 



n 



n7 



< 1 



n, 



Hf du 



-1 



0, 



where equality can hold in (A.3) iff = i^-a.e. 
We assume the mixtures are identifiable, so this im- 
plies Lp = f. Therefore, l{(p) =0 iff ip = f, and we 
have shown that ^ is a strong Lyapunov function 
on A'^. To prove that / is a globally asymptotically 
stable point for (p = h{(p), suppose that ip{t) is a so- 
lution, with p{0) = fo, that does not converge to /. 
Since ^ is a strong Lyapunov function, the sequence 
i{(p{t)), as t — )■ oo, is bounded, strictly decreasing 
and, thus, has a limit A > 0. Then the trajectory 
p{t) must fall in the set 

A* = {99eA'^:A<%)<£(/o)} 

for all t > 0. In the case / € int(A'^), £{ip) — >■ oo as 
p — )■ 9A, so the set A* is compact (in the relative 
topology). If / € dA'^, then A* is not compact but, 
as shown above, i{ip) is well defined and continuous 
there. In either case, £ is continuous and bounded 
away from zero on A*, so 

sup i{(p) = -L < 0. 
Then, for any r > 0, we have 

ii^ir)) = £(/o) + r iiipis)) ds < £(/o) - Lt. 



If r > £{fo)/L, then £{ip{t)) < 0, which is a con- 
tradiction. Therefore, p}{t) — )• / for all initial condi- 
tions ip{Q) = /o, so / is globally asymptotically sta- 
ble. Theorem 2.10 then implies fn^ f a-s. 
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A. 2 Proof of Theorem 4.1 

The proof of the theorem requires the followmg 
lemma, estabhshing a Lipschitz-type bound on the 
error terms /3„ in (4.5). Its proof follows immediately 
from NP4 and the Mean Value Theorem. 

Lemma A.l. Under the assumptions of Theo- 
rem 4-1, there exists a number A € (0,oo) such that 



3„ii<AE(ien-eii^ 



Proof of Theorem 4.1. The map h in (4.4) 
has A:th component 



n/.g(x) 

n^,c(2;) 



p{x\ek,i)v''dv^{x)-<f^ 



where Iif^^{x) = '}ZkP^'^\^k^O is the marginal den- 
sity of X and u'' is the product measure on X^' . No- 
tice that this /i, which does not depend on the esti- 
mate is exactly the same as the h in (3.6). There- 
fore, the continuity and stability properties derived 
in the proof of Theorem 3.4 are valid here as well. 
All that remains is to show that the P^s in (4.5) 
satisfy SA3 of Theorem 2.10. 

By the SLLN, ^„ belongs to Hq for large enough n 
so we can assume, without loss of generality, that no 

projection is necessary. Let Sn = Zi-\ \-Zn-, where 

the Zi = ?;"^(^(*) — ^) and v"^ is the variance of ^^"^^ 
Then — ^| = cn~^\Sn\, where c> is a constant 
independent of n. Since Sn is a sum of i.i.d. random 
variables with mean zero and unit variance, the Law 
of Iterated Logarithm states that 



(A.4) 



limsup{|5„|/ii(n)} = 1 a.s. 



Now, by Lemma A.l and (A.4) we have 

WnW < Acn"^E{\Sn\ I ^„-i) = 0{n-\{n)) 

and, therefore, J2n'^n\\f3n\\ converges a.s. by NP5. 
Condition SA3 is satisfied, completing the proof. □ 

A.S Proof of Proposition 4.3 

To prove that the case of a location-mixture of 
normals with unknown variance is covered by The- 
orem 4.1, we must show that the function H, de- 
fined in (4.7), satisfies NP4, that is, that the partial 
derivatives -^Hi^{s]ip,il)) are bounded. 



Proof of Proposition 4.3. Clearly each com- 
ponent of H, defined in (4.7), is differentiable 
with respect to ^ G Sq and, after simplification. 



d 

—Hk{s,^,il^) 



99 e 



Y,jUkj{s)ipie 



[Y^.^je-'-^V^^e''^^/'^f 

where (as \s\ — t- 00) 

(A.5) Ukj{s) = el- + 2s{ej - Ok) = 0{\s\). 

This derivative is continuous on s{X'^) x Aq x Sq 
and, since Aq and Sq are compact, we know that 

d 



(A.6) Ak{s) :-- 



sup sup 

ipeAoj/^GSo 



dtp 



is finite for all s G s(A"') and for all k. By the Mean 
Value Theorem, 

\Hk{s;ip,tl;) - Hk{s;ip,a'^)\ < ^^.(s)!^' - cj^|. 

It remains to show that Ak{s) is bounded in s. For 
notational simplicity, assume that f and ^ are the 
values for which the suprema in (A.6) are attained. 
Making a change of variables y = rs/ip we can, with 
a slight abuse of notation, write 

™ - • 

We must show that Af^^y) is bounded as \y\ — >■ 00. 
Assume, without loss of generality, that the 6's are 
arranged in ascending order: 9i < 02 < ■ • ■ < 0^- Fac- 
toring out, respectively, e^^^ and e^^'*, we can write 

Ck^''ey(''^-'^')ZiMy)\^'ey('^-'^^ 



Ak{y) < 



Ckip'^ey'~'^-^^^Z.\ukj{y)\v^ey('^ 



dMipy{Sj-ei)+y{e,-ea) ' 



Note that since € Aq, each ip^ is bounded away 
from 0. If y — )■ —00, then the term e^^^*-"^^) — )■ 
dominates the numerator of the first inequality, while 
the denominator is bounded. Similarly, if y — >■ +00, 
then the term e^^^''"^''^ — )■ dominates the numer- 
ator in the second inequality, while the denomina- 
tor is bounded. For the case k = 1 or k = d, note 
that |iiii(y)| = \uiid{y)\ = 0, so the two inequalities 
can still be applied and a similar argument shows 
Ai and Ad are also bounded. Therefore, Ak{y) is 
bounded for each k and the claim follows by taking 
A to be max{sup„ Ak{y) :! < k < d}. □ 
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